-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We'll need tests and a whatsnew for this.
pandas/io/parquet.py
Outdated
@@ -104,19 +104,26 @@ def write( | |||
from_pandas_kwargs["preserve_index"] = index | |||
|
|||
table = self.api.Table.from_pandas(df, **from_pandas_kwargs) | |||
|
|||
fs = get_fs_for_path(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should be done at a higher level in the base class and passed through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call it filesystem
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastParquetImpl
doesn't require this as it uses get_file_and_filesystem
to obtain the filesystem in the read
method.
fs
is used in the read
method for PyArrowImpl
. To be consistent with it, fs
is a better choice.
# write_to_dataset does not support a file-like object when | ||
# a directory path is used, so just pass the path string. | ||
if partition_cols is not None: | ||
self.api.parquet.write_to_dataset( | ||
table, | ||
path, | ||
compression=compression, | ||
filesystem=fs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user provides a filesystem
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I realised this after looking at the tests that failed.
Perhaps, filesystem
should be one of the keyword arguments instead of getting it from kwargs
?
I think we need to update the doc to mention that filesystem
is needed if the system can't obtain the underlying credentials from e.g. ~/.aws/credentials
?
@@ -568,6 +568,24 @@ def test_s3_roundtrip_for_dir(self, df_compat, s3_resource, pa, partition_col): | |||
repeat=1, | |||
) | |||
|
|||
@td.skip_if_no("s3fs") | |||
@pytest.mark.parametrize("partition_col", [["A"], []]) | |||
def test_s3_roundtrip_for_dir_infer_fs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks identical to https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py - could we parametrize instead here?
@@ -104,19 +104,23 @@ def write( | |||
from_pandas_kwargs["preserve_index"] = index | |||
|
|||
table = self.api.Table.from_pandas(df, **from_pandas_kwargs) | |||
|
|||
fs = kwargs.pop("filesystem", get_fs_for_path(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should wait till: https://github.com/pandas-dev/pandas/pull/34266/files#diff-0d7b5a2c72b4dfc11d80afe159d45ff8L153 is merged. Since this method is changing. @TomAugspurger might have thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that #34266 is a more robust solution, which should be in v1.1?
This PR could be a patch for 1.0.x as to_parquet
with partition_cols
specified is not working as designed. Without partition_cols
it works as intended without the need to specify the file system explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We original backported this functionality to 1.0.x but it was then reverted in #34632 -> So there is no s3 directory functionality on 1.0.x currently. See whatsnew: https://pandas.pydata.org/docs/whatsnew/v1.0.5.html#fixed-regressions. So think this will have to wait for 1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good. I will alter this after #34266 is merged.
@kylase are you still interesting in working on this? whatsnew will be 1.2 now Thanks |
I have
I have looked at the tests in 1.1 and there are tests for explicit filesystem and inferred. Therefore I feel that the issue has been resolved. Will proceed to close this PR and the issue. |
Infer the file system to be passed to pyarrow based on the path provided.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff